Model Selection

Multimodal QA

# Multimodal QA

A vision-language model based on ViLT architecture, fine-tuned specifically for GQA visual reasoning tasks

VL Rethinker 7B 6bit

This is a multimodal model based on Qwen2.5-VL-7B-Instruct, supporting visual question answering tasks, converted to MLX format for efficient operation on Apple chips.

Transformers English

VL Rethinker 7B 8bit

VL-Rethinker-7B-8bit is a multimodal model based on Qwen2.5-VL-7B-Instruct, supporting visual question answering tasks.

Transformers English

Tinyllava Video Qwen2.5 3B Group 16 512

TinyLLaVA-Video is a video understanding model based on Qwen2.5-3B and siglip-so400m-patch14-384, utilizing a grouped resampler for video frame processing

Videochat Flash Qwen2 7B Res224

A multimodal model built on UMT-L and Qwen2-7B, supporting long video understanding with only 16 tokens per frame and an extended context window of 128k.

Transformers English

A vision-language model based on Microsoft's Phi-1.5 architecture, combined with CLIP for image processing capabilities

Transformers Supports Multiple Languages

Idefics3 8B Llama3

Idefics3 is an open-source multimodal model capable of processing arbitrary sequences of image and text inputs to generate text outputs. It shows significant improvements in OCR, document understanding, and visual reasoning.

Transformers English

Idefics2 is an open-source multimodal model capable of accepting arbitrary sequences of image and text inputs to generate text outputs. It shows significant improvements in OCR, document understanding, and visual reasoning.

Transformers English

Llava-Phi2 is a multimodal implementation based on Phi2, combining vision and language processing capabilities, suitable for image-text-to-text tasks.

Transformers English

Video Blip Opt 2.7b Ego4d

VideoBLIP is an enhanced version of BLIP-2 capable of processing video data, using OPT-2.7b as the language model backbone.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase